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From genes to protein structure and function: 
novel applications of computational 
approaches in the genomic era 

Jeffrey. Skolnick and Jacquelyn S. Fetrow 

The genome-sequencing projects are providing a detailed 'parts list* of life. A key to comprehending this list is understanding 
the function of each gene and each protein at various levels. Sequence-based methods for function prediction are inadequate 
because of the multifunctional nature of proteins. However, just knowing the structure of the protein is also insufficient for 
prediction of multiple functional sites. Structural descriptors for protein functional sites are crucial for unlocking the secrets 
in both the sequence and structural-genomics projects. 



Genome-sequencing projects are providing a 
detailed 'parts list' for life. Unfortunately, this list, 
a portion of which represents the amino acid 
sequence of all the proteins in a given genome, does 
not come with an instruction manual. That Is, given 
the genome's sequences, one does not necessarily know 
straight away which regions encode proteins, which 
serve a regulatory role and which are responsible for 
the structure and replication of the DNA itself. 

This is not unlike giving a child a list of parts nec- 
essary to create a working automobile. Without the 
necessary expertise, creating the final, working car from 
just the initial parts list Is a nearly impossible task. Simi- 
larly, understanding how to create a complete, func- 
tioning cell given just the sequence of nucleotides 
found in an organisms genome Is a complex problem. 

What is a protein function? 

After a genome is sequenced and its complete parts 
list determined, the next goal is to understand the func- 
tion^) of each part, including that of the proteins. What 
do we mean by protein function, the focus of this article? 

Function has many meanings. At one level, the pro- 
tein could be a globular protein, such as an enzyme, 
hormone or antibody, or it could be a structural or 
membrane-bound protein. Another level is its bio- 
chemical function, such as the chemical reaction and 
the substrate specificity of an enzyme. The regulatory 
molecules or cofactors that bind to a protein are also 
levels of biochemical function. 

At the cellular level, the protein's function would 
involve its interaction with other macromolecules and 
the function and cellular location of such complexes. 
There is also the protein's physiological function; that 
is, in which metabolic pathway the protein is involved 
or what physiological role it performs in the organism. 
Finally, the phenotypic function is the role played by 
the protein in the total organism, which is observed by 
deleting or mutating the gene encoding the protein. 
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Obviously the complete characterization of protein 
function is difficult but efforts are under way at all levels 1-4 , 
including cellular function 5 ' 6 . In this article, however, 
we focus on identifying the biochemical function of a 
protein given its sequence, a problem that is amenable to 
molecular approaches. 

Sequence-based approaches to function 
prediction 

The sequence-to-function approach is the most com- 
monly used function-prediction method. This robust 
field is well developed and, in the interest of space 
limitations, we will merely present a brief overview. 

There are two main flavors of this approach: sequence 
alignment 7 " 9 ; and sequence- motif methods such as 
Prosite 10 , Blocks 11 ,- Printed 13 and Embtif 14 . Doth the 
alignment and the motif methods are powerful but % 
recent analysis has demonstrated their significant lirni^ 
tations 15 , suggesting that these methods will increasingly 
fail as the protein-sequence databases become more 
diverse. 

An extension of these approaches that combines 
protein-sequence with structural information has been 
developed and some successes have been reported 16 . 
However, this method still applies the structural infor- 
mation in a one-dimensional, \sequence-like* fashion 
and fails to take into account the powerful three- 
dimensional information displayed by protein structures. 

In addition, proteins can gain and lose function dur- 
ing evolution and may, indeed, have multiple functions 
in the cell (Box 1). Sequence-to-function methods 
cannot specifically identify these complexities. Inaccu- 
rate use of sequence-to-function methods has led to 
significant function-annotation errors in the sequence 
databases 17 . 

An alternative approach 

An alternative, complementary approach to protein- 
function prediction uses the sequence-to-structure-to- 
function paradigm. Here, the goal is to determine the 
structure of the protein of interest and then to identify 
the functionally important residues in that structure. 
Using the chemical structure itself to identify functional 
sites is more in line with how the. protein actually works. 
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In a sense, this Is one long-term goal of 'structural 
genomics' projects 18 * 19 , which are designed to deter- 
mine all possible protein folds experimentally, just 
as genome-sequencing projects are determining all 
protein sequences 20 . This is in contrast to traditional 
structural-biology approaches, in which one knows the 
protein's function first and only then, if the function is 
sufficiently important, determines its structure. 

It Is implicitly assumed that having the protein's struc- 
ture will provide insights into its function, thereby fur- 
thering the goals of the human-genome-sequencing 
project. However, knowing a protein's three-dimensional 
structure is insufficient to determine its function 
(Box 2). What we really need to analyse and predict the 
multifunctional aspects of proteins is a method spe- 
cifically to recognize active sites and binding regions in 
these protein structures. 

Active-site identification 

In order to use a structure-based approach to function 
prediction, one must identify the key residues respon- 
sible for a given biochemical activity. For many years, 
it has been suggested that the active sites in proteins are 
better conserved than the overall fold. Taken to the 
limit, this suggests that one could not only identify dis- 
tant ancestors with the same global fold and the same 
activity but also proteins with similar functions but 
distantly related, or possibly unrelated, global folds. 

The validity of this suggestion was demonstrated 
empiricaEy by Nussinov and co-workers, who showed 
that the active sites of eukaryotic serine proteases, sub- 
tilisins and sulfhydryl proteases exhibit similar structural 
motifs 31 . Furthermore, in a recent modeling study of 
Sacchawmyces cereuisiae proteins, protein functional sites 
were found to be more conserved than other parts of 
the protein models 22 . Similarly it has been demon- 
strated that the catalytic triad of the a/p hydrolases 
is structurally better conserved than other hlstidine- 
containirtg triads 2 - 1 . A comparison of the structure of the 
hydrolase catalytic triad to other histidine-containirig 
triads shows a distinct bimodal distribution, while a 
similar analysis done with a randomly selected triad shows 
a unimodal distribution (Fig. 1). 

Kasuya and Thornton 24 generalized this example by 
creating structural analogs of a few Prosite sequence 
motifs 10 . For the 20 most-frequendy occurring Prosite 
patterns, the associated local structure is quite distinct. 
These results provide clear evidence that enzyme active 
sites are indeed more highly conserved than other parts 
of the protein. 

Identifying active sites in experimental structures 

Historically, several groups have attempted to iden- 
tify functional sites in proteins; these efforts were 
directed at protein engineering or building functional 
sites in places where they did not previously exist. This 
has been successfully accomplished for several metal- 
binding sites 2 *" 33 . However, highly accurate functional- 
site descriptors of the backbone and side-chain atoms were 
required, fueling the belief that significant atomic detail 
is required in site descriptors for function identification. 

Highly detailed residue side-chain descriptors of the 
active sites of serine proteases and related proteins have 
. been used to identify functional sites 3 . The use of these 
highly detailed motifs has led to the identification of 



Box 1. Proteins are multifunctional 



A common protein characteristic that makes functional analysis based 
only on homology especially difficult is the tendency of proteins to be 
multifunctional. For instance, lactate dehydrogenase binds NAD, sub- 
strate and zinc, and performs a redox reaction. Each of these occurs 
at different functional sites that are in close proximity and the combi- 
nation of all four sites creates the fully functional protein. 

Other examples of multifunctional proteins are the nucleic-acid-binding 
proteins. For instance, DNA regulatory proteins often contain a DNA- 
binding domain, a multimerization domain and additional sites that bind 
regulatory proteins; a classic example is RecA 69 . The 3C rhinovirus 
protease exhibits a proteolytic function as well as an RNA-binding 
functional. Transcription factors are also complex, multifunctional 
proteins 62 . It is becoming increasingly important to recognize each of 
these different functions of gene products of a newly sequenced gene. 

The serine-threonine-phosphatase superfamily is a prime example of 
the difficulties of using standard sequence analysis to recognize the 
multiple functions found in single proteins. This large protein family is 
divided into a number of subfamilies, all of which contain an essential 
phosphatase active site. Subfamilies 1, 2A and 2B exhibit 40% or more 
sequence identity between them 63 . However, each of these subfamilies 
is apparently regulated differently in the cell 6 *-* 7 and observation sug- 
gests that there are different functional sites at which regulation can 
occur. Because the sequence identity between subfamilies is so high, 
standard sequence-similarity methods could easily misclassify new 
sequences as members of the wrong subfamily if the functional sites 
are not carefully considered, as was recently demonstrated 43 . 

these are but a few examples of the multifunctionality of proteins. 
The recognition of this multifunctional nature is of critical importance 
to the genomics field. Useful functional-annotation methods must con- 
sider all of the specific functions in a given protein and will not just 
provide a general classification of function. 



several novel functional sites in known, high-quality 
protein structures 1 - 34 . More automated methods for 
finding spatial motifs in protein structures have also 
been, described*'- 14 " 40 . 

Unfortunately, most of these methods require the 
exact placement of atoms within protein backbones jand 
side chaias, and so have not beeii shown to: be relevant 
to inexact predicted structures. Recendy, however, we 
described the production of fuzzy, inexact descriptors 
of protein functional sites 15 . As we wish to apply the 
descriptors to experimental structures as well as to pre- 
dicted protein models, we used only carbon atoms and 
side-chain centers- of-mass positions. We call these 
descriptors 'fuzzy functional forms* (FFFs) and have 
created them for both the disulfide-oxidoreductase ls - 41 
and a/p-hydrolase catalytic active sites 31 . 

The disulfide-oxidoreductase FFF was applied to 
screen high-resolution structures from the Drookhaven 
protein database 42 . In a dataset of 364 protein structures, 
the FFF accurately identified all proteins known to 
exhibit the disulfide-oxidoreductase active site 1 *. In a 
larger dataset of 1501 proteins, the FFF again accurately 
identified all proteins with the active site. In addition, 
it identified another protein, lfjm, a serine-threonine 
phosphatase. This result was initially discouraging but 
subsequent sequence alignment and clustering analysis 
strongly suggested that this putative site might indeed 
be a site of redox regulation in the serine-threonine 
phosphatase- 1 subfamily 4 - 1 . If confirmed by experiment, 
this result will highlight the advantages of using struc- 
tural descriptors to analyse multiple functional sites in 
proteins. It will also highlight the fact that human 
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Box 2. Knowing a protein's structure does not necessarily 
tell you its function 



Because proteins can have similar folds but different functions 6869 , 
determining the structure of a protein may or may not teli you some- 
thing about its function. The most well-studied example is the (a/p) 8 
barrel enzymes, of which triose-phosphate isomerase (TIM) is the arche- 
typal representative. Members of this family have similar overall struc- 
tures but different functions, including different active sites, substrate 
specificities and cofactor requirements 70 - 71 . 

Is this example common? Our own analysis of the 1997 SCOP data- 
base 68 shows that the five largest fold families are the ferredoxin- 
like, the (a/(3) barrels, the knottins, the immunoglobulin-like and the 
flavodoxin-like fold families with 22, 18, 13,9 and 9 subfamilies, respec- 
tively (Fig. i). In fact, 57 of the SCOP fold families consist of multiple 
superfamilies. These data only show the tip of the iceberg, because 
each superfamily is further composed of protein families and each indi- 
vidual family can have radically different functions. For example, the 
ferredoxirdike superfamily contains families identified as Fe-S ferredoxins, 
ribosomal proteins, DNA-binding proteins and phosphatases, among 
others. 

After this article was submitted, a much-mor&<ietailed analysis of the 
SCOP database was published 72 . This finds a broad function-structure 
correlation for some structural classes, but also finds a number of 
ubiquitous functions and structures that occur across a number of fam- 
ilies. The article provides a useful analysis of the confidence with which 
structure and function can be correlated 72 . Knowing the protein struc- 
ture by itself is insufficient to annotate a number of functional classes 
and is also insufficient for annotating the specific details of protein 
function. 
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Figure I 

Histogram of the numbers of superfamilies found in each SCOP fold family. 
These data clearly show that proteins with similar structures can have different 
functions and demonstrate the difficulty of assigning protein function based 
simply on the three-dimensional structure. The data were taken from the 1997 
distribution of SCOP (http://scop.mrc-lmb.cam.ac.uk/scop). For a more-detailed 
analysis, seeRef. 72. 



observation alone is no longer adequate for identifying 
all functional sites in known protein structures. 

To date, the use of structure to identify function has 
largely focused on high-resolution structures and highly 
detailed descriptors of protein functional sites. How- 
ever, the creation of inexact descriptors for functional 
sites opens the way to the application of these methods 
to inexact, predicted protein models. The question 
remains: how good does a model have to be in order 
to use FPFs to identify its active sites? 



The state of the art in structure-prediction 
methods 

For proteins whose sequence identity is above ~30%, 
one can use homology modeling to build the struc- 
ture 44 . However, structure prediction is far more difficult 
for proteins that are not homologous to proteins with 
known structure. At present, there are two approaches for 
these sequences: ab initio folding 45 " 48 and threading 4 '*" 53 . 

In ab initio folding, one starts from a random confor- 
mation and then attempts to assemble the native struc- 
ture. As this method does not rely on a library of 
pre-existing folds, it can be used to predict novel 
folds. The recent CASP3 protein-structure-prediction 
experiment (http://PredictionCenter.Unl.gov/CASP3) 
involved the blind prediction of the structure of pro- 
teins whose actual structure was about to be experi- 
mentally determined. These results indicate that con- 
siderable progress has been made 46 - 54 . For helical and 
ct/p proteins with less than 110 residues, structures 
were often predicted whose backbone root-mean- 
square deviation (RMSD) from native ranged from 
4—7 A. Progress is being made with the p proteins, too, 
although they remain problematic. Because ab initio 
methods can identify novel folds, these methods could 
be used to help to select sequences likely to yield novel 
folds in experimental structural-genomics projects. 

Another approach to tertiary-structure prediction is 
threading. Here, for the sequence of interest, one 
attempts to find the closest matching structure in a 
library of known folds 52 - 55 . Threading Is applicable to 
proteins of up to 500 residues or so and is much faster 
than ab initio approaches. However, threading cannot 
be used to obtain novel folds. 

Ab initio predicted models can be used for automatic 
protein-function prediction 

The results of the recent CASP3 competition sug- 
gest that current modeling methods can often (but not 
always) create inexact protein models. Are these struc- 
tures useful for identifying functional sites in proteins? 
Using the ab initio structure-prediction program 
MONSSTER, the tertiary structure of a glutaredoxin, 
lego, was predicted 56 . For the lowest-energy model, 
the overall backbone RMSD from the crystal structure 
was 5.7 A. 

To determine whether this inexact model could be 
used for function identification, the sets of correctly 
and incorrecdy folded structures were screened with 
the FFF for disulfide-oxidoreductase activity 15 . The 
FFF uniquely identified the active site in the correcdy 
folded structure but not in the incorrecdy folded ones 
(Fig. 2). This is a proof-of-principle demonstration that 
inexact models produced by ab initio prediction of 
structure from sequence can be used for the subsequent 
prediction of biochemical function. Of course, improve- 
ments in the method have to be made before such 
predictions can be done on a routine basis. 

Use of predicted structures from threading in 
protein-function prediction 

At present, practical limitations preclude folding an 
entire genome of proteins using ab initio methods 57 . 
Threading is more appropriate for achieving the requisite 
high-throughput structure prediction. Thus, a stand- 
ard threading algorithm 5 " has been used to screen all 
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proteins ill nine genomes for the dlsulfide-oxidoreductase 
active site described above. 

First, sequences that aligned with the structures of 
known disulfide oxidoreductases were identified. Then, 
the structure was searched for matches to the active- 
site residues and geometry. For those sequences for 
which other homologs were available, a sequence- 
conservation profile was constructed 2 - 1 . If the putative 
active-site residues were not conserved in the sequence 
subfamily to which the protein belongs, that sequence 
was elimiiiated. Otherwise, the sequence is predicted 
to have the function. 

Using this sequence-to-structure-to-runction method, 
99% of the proteins in the nine genomes that have 
known disulfide-oxidoreductase activity have been 
found. Prom 10% to 30% more functional predictions 
are made than by alternative sequence-based approaches; 
similar results are seen for the a/(3 hydrolases 2 - 1 . Sur- 
prisingly, in spite of the fact that threading algorithms 
have problems generating good sequence-to-structure 
alignments, active sites are often accurately aligned, 
erven for very distant matches. This observation would 
agree with the above experimental results indicating 
that active sites are well conserved in protein structures. 

Importandy, the false-positive rate when using struc- s 
tural information is much lower than that found using 
sequence-based approaches, as demonstrated by a 
detailed comparison of the FFF structural approach and 
the Blocks sequence-motif approach (N. Siew et ai, 
unpublished). In this study, the sequences in eight 
genomes, including Bacillus subtilis, were analysed for 
disulfide-oxidoreductase function using the disulfide- 
oxidoreductase FFF, the thioredoxin Block 00194 and 
the glutaredoxiii Block 00195. If we assume that those 
sequences identified by both the FFF and Blocks 
are 'true positives', we find 13 such sequences in the 
B. subtilis genome. 

There is no experimental evidence validating all of 
these 'true positives' and so they are more accurately 
termed 'consensus positives'. In order to find these 13 
'consensus positive' sequences, the FFF hits seven false 
positives. On the other hand, Blocks hits 23 false 
positives (Fig. 3). It was previously suggested that the 
use of a functional requirement adds information to 
threading and reduces the number of false positives 53 . 
These data, including the data shown in Fig, 3, validate 
this claim on a genome-wide basis. 

Of course, as no genome has had the function of all 
of its proteins experimentally annotated, it is imposs- 
ible to know how many other proteins with the speci- 
fied biochemical function were not properly identified. 
This is a critical question for researchers attempting to 
predict protein function. Experimental confirmation 
will be needed to validate this or any other method 
fully This points out the need for closely coupling 
computational function-prediction algorithms with 
experiments. 

Weaknesses of using the sequence-to-structure- 
to-function method of function prediction 

Based on studies to date, the identification of enzy- 
matic activity requires a model in which the backbone 
RMSD from native near the active sites is about 4-5 A, 
Predicted models are better at describing the geometry 
in the core of the molecule than in the loops and so 
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Figure 1 

The distribution of root-mean-square distributions (RMSD) between the hydrolase 
catalytic triad and all other histidine-containing triads shows a bimodal distribution 
{a); by contrast, the RMSD between a randomly selected (non-catalytic) triad and all 
other histidine-containing triads has a unimodal distribution (b). The His-Ser-Asp 
catalytic trial in the proteirvl gpl (Rp2 lipase) (a) and a random histidine-containing 
triad from 4pga (glutaminase-asparaginase) (b| were structurally aligned to all His- 
containing triads in a database of 1037 proteins 23 . Actual a/p-hydrolase active sites 
(a) and the 4pga site (h) are indicated by blue bars; other histidine triads that are 
not active sites are indicated by red bars. None of the sites found by matching to the 
4pga were hydrolase active sites. Inset graphs show the full distribution. 

predicting the function of a protein whose active site is 
hi loops may be a problem. Also, the method can cur- 
rendy only be applied to enzyme active sites; substrate- 
and hgand-binding sites have not been identified using 
the inexact models. Techniques that will further refine 
inexact protein models will be quite useful in taking 
the protein analysis to the next step. 

Conclusions 

Although sequence-based approaches to protein- 
function prediction have proved to be very useful, alter- 
natives are needed to assign the biochemical function 
of the 30-50% of proteins whose function cannot be 
assigned by any current methods. One emerging 
approach involves the sequence-to-structure-to-function 
paradigm. Such structures might be provided by struc- 
tural-genomics projects or by structure-prediction 
algorithms. Functional assignment is made by screen- 
ing the resulting structure against a library of structural 
descriptors for known active sites or binding regions. 
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Figure 2 

Application of the disulfide-oxidoreductase fuzzy functional form (FFF) to ab initio 
models of glutaredoxin created by the program MONSSTER shows that the FFF can 
distinguish between correctly folded and misfolded (or higher-energy) models. The FF 
is shown as two orange balls (representing the cysteines) and a blue ball (represent- 
ing the proline). "Hie protein models are shown as magenta wire models with the active- 
site cysteines and proline shown as yellow and cyan balls, respectively. The FFF clearly 
distinguishes the correct active site in the crystal structure of the glutaredoxin lego 
and the correctly folded, lowest-energy model. The FFF does not match to the active 
sites of any of the higher energy, misfolded structures, four of which are shown here. 
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Figure 3 

Analysis of the Bacillus subtilis genome using the thbredoxin Block 00194. The Blocks 
score (computed using the publicly available BLIMPS program) is plotted on the xaxis 
and the number of sequences found in each scoring bin is plotted on the yaxts. Those 
sequences identified as 'consensus positives' (identified by both the fuzzy functional 
form (FFF) and the Block! are shown as red bars. One additional sequence found by 
the FFF, which is likely to be a true positive, is shown as a blue bar. All other 
sequences, putative false positives', are shown as yellow bars. Using the Blocks 
score at which all 13 of the 'consensus positives' are found, 23 false positives are 
also found. In its analysis of the B. subtilis genome, the FFF identifies only seven false 
positives along with the same 13 'consensus positives' (data not shown}. 

Detailed descriptors will only work on the experi- 
mentally determined, high-quality structures. Ideally, 
however, the descriptors should work on both experi- 
mental structures and the cruder models provided by 
tertiary-structure -prediction algorithms. 

The advantages of such an approach are that one need 
not establish an evolutionary relationship in order to 
assign function, that more than one function can be 
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assigned to a given protein [an issue of major impor- 
tance, because proteins are multifunctional (Box 1)] 
and, ultimately, that having a structure can provide 
deeper insight into the biological mechanism of pro- 
tein function and regulation. The disadvantages are that 
one needs to have the protein's structure before a func- 
tion can be assigned and that the approach is limited to 
those functions associated with proteins with at least 
one solved structure, so that a functional-site descriptor 
can be constructed. 

In this sense, structure- to -function assignment can be 
thought of as 'functional threading* - find the active- 
site match in a library of descriptors for known protein 
active sites. This is the first step in the long process of 
using structure to assign all levels of function, a goal 
that is made increasingly important with the emergence 
of structural genomics. Based on the progress to date, 
it is apparent that structure will play an important role 
in the post-genomic era of biology. 
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